Variable-Span out-of-vocabulary named entity detection

نویسندگان

Wei Chen

Sankaranarayanan Ananthakrishnan

Rohit Prasad

Premkumar Natarajan

چکیده

Out-of-vocabulary named entities (OOV NEs) are always misrecognized by fixed-vocabulary automatic speech recognition (ASR) systems. This has a negative impact on downstream applications such as language understanding and machine translation (MT). Automatic detection of OOV NEs in ASR hypotheses can help mitigate this problem by triggering the use of alternative approaches to acquire and process these NEs. State-of-the-art OOV NE detection typically involves tagging ASR-hypothesized words using a sequence model, such as conditional random fields (CRF), in conjunction with a variety of contextual and ASR-derived features. In this paper, we propose a novel variable-span tagging approach for detecting OOV NEs. Instead of tagging individual words in ASR hypotheses, we directly tag longer spans of consecutive words. The proposed approach outperforms a state-of-the-art CRF tagger on two distinct heldout test sets with different OOV NE distributions. On a 5.1Kword test set rich in OOV NEs, our method achieves 56.1% detection rate at 10% false alarm rate (vs. 52.1% for the CRF detector). On a 39.4K-word test set with a natural distribution of OOV NEs, we obtain 73.0% detection rate at 10% false alarm rate (vs. 69.5% for the CRF detector). In all cases, OOV NEs are completely unobserved in our training data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

THE JOHNS HOPKINS UNIVERSITY Sub-Lexical and Contextual Modeling of Out-of-Vocabulary Words in Speech Recognition

Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. We present a novel probabilistic model to l...

متن کامل

A combined Approach to Arabic Named Entity recognition Using SVM and Pattern Extracted method applied to Topic Detection

Named Entity Recognition (NER) is a clue task for automatic text processing that is required in a wide variety of applications. NER techniques range from handcrafted rules to machine learning approaches. In this paper, we describe the development and implementation of an Arabic Named Entity Recognition (ANER) System, based on machine learning approach. We used SVM classifier with a set of depen...

متن کامل

A spoken term detection framework for recovering out-of-vocabulary words using the web

Vocabulary restrictions in large vocabulary continuous speech recognition (LVCSR) systems mean that out-of-vocabulary (OOV) words are lost in the output. However, OOV words tend to be information rich terms (often named entities) and their omission from the transcript negatively affects both usability and downstream NLP technologies, such as machine translation or knowledge distillation. We pro...

متن کامل

OOV Sensitive Named-Entity Recognition in Speech

Named Entity Recognition (NER), an information extraction task, is typically applied to spoken documents by cascading a large vocabulary continuous speech recognizer (LVCSR) and a named entity tagger. Recognizing named entities in automatically decoded speech is difficult since LVCSR errors can confuse the tagger. This is especially true of out-of-vocabulary (OOV) words, which are often named e...

متن کامل

Multilingual Language Processing From Bytes

We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads text as bytes and outputs span annotations of the form [start, length, label] where start positions, lengths, and labels are separate entries in our vocabulary. Because we operate directly on unicode bytes rather than languagespecific words or characters, we can analyze text in many languages with a single model. Due to...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

Variable-Span out-of-vocabulary named entity detection

نویسندگان

چکیده

منابع مشابه

THE JOHNS HOPKINS UNIVERSITY Sub-Lexical and Contextual Modeling of Out-of-Vocabulary Words in Speech Recognition

A combined Approach to Arabic Named Entity recognition Using SVM and Pattern Extracted method applied to Topic Detection

A spoken term detection framework for recovering out-of-vocabulary words using the web

OOV Sensitive Named-Entity Recognition in Speech

Multilingual Language Processing From Bytes

عنوان ژورنال:

اشتراک گذاری